Since the beginning of the Landsat missions, the remote sensing community has been interested in developing universal algorithms for extracting water quality information from remotely sensed images [@Lots of old papers]. While there has been significant success in the oceanic community towards universal algorithms for chlorophyll, sediment, and doc [cites], there is no inland water equivalent. Much of this discrepancy comes from the increased optical complexity of inland waters, which prevents the use of a more universal algorithm, but progress on inland waters is further impeded by the lack of a shared dataset of overpasses and in situ concentration information. Here we create and share the largest such overpass dataset ever assembled. We also outline and share our approach to bringing three publicly available, free datasets to generate a high-graded analysis-ready dataset for remote sensors of water quality. While a specific universal algorithm may be an unattainable goal, we anticipate that this dataset will move us towards more universal approaches based on shared and equal access to overpass information.
Despite the long-recognized potential, until recently, the general hydrology and limnology communities have not integrated data from remote sensing of inland waters into our research approach [Topp]. Instead, these communities have focused much of our research on Eulerian sampling schemes with sensors or people repeatedly sampling the same points in a river or lake [DoyleEnsign]. This research approach has generated a wealth of information on temporal variability in inland waters, but there has been less work looking at spatial variability in rivers, lakes, and estuaries. Remote estimates of water quality in these ecosystems would allow for rapid assessment of potential algae blooms, detection of high-sediment waters, and analysis of spatio-temporal variability [cites].
Serious citation of Topp, maybe none of this at all?
With the profusion of publicly available in situ water quality datasets and the relatively easily-accessible satellite mission archive
| Satellite | Years | Available images |
|---|---|---|
| 5 | 1984-2012 | 192,688 |
| 7 | 1999-2018 | 188,781 |
| 8 | 2013-2018 | 58,585 |
For LAGOSNE data see here
Dataset generation
| type | chl_a | doc | secchi | tss |
|---|---|---|---|---|
| Estuary | 170,549 | 39,186 | 363,607 | 160,532 |
| Lake | 837,747 | 73,587 | 2,041,409 | 195,557 |
| Stream | 374,772 | 339,972 | 346,537 | 2,735,913 |
| Total | 1,383,068 | 452,745 | 2,751,553 | 3,092,002 |
| type | chl_a | doc | secchi | tss |
|---|---|---|---|---|
| Estuary | 28,940 | 5,826 | 47,956 | 26,883 |
| Lake | 128,453 | 9,545 | 350,428 | 26,864 |
| Stream | 22,403 | 16,353 | 35,101 | 54,297 |
| Total | 179,796 | 31,724 | 433,485 | 108,044 |
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 51 rows containing missing values (geom_bar).
## Warning: Removed 581 rows containing non-finite values (stat_boxplot).
## Reading layer `ne_10m_rivers_lake_centerlines' from data source `/Users/mrross/Dropbox/UNC-PostDocAll/aquasat/ne_10m_rivers_lake_centerlines/ne_10m_rivers_lake_centerlines.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1455 features and 34 fields (with 1 geometry empty)
## geometry type: MULTILINESTRING
## dimension: XY
## bbox: xmin: -164.9035 ymin: -52.15773 xmax: 177.5204 ymax: 75.79348
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
## # A tibble: 10 x 5
## name rmse count mdae mape
## <chr> <dbl> <int> <dbl> <dbl>
## 1 Suwannee 1.40 34 0.999 0.268
## 2 Niagara 1.52 26 1.01 0.250
## 3 Willamette 1.58 22 1.38 0.447
## 4 St. Clair 2.25 31 1.73 0.463
## 5 Wisconsin 2.54 26 1.92 0.290
## 6 Roanoke 5.31 68 1.97 0.356
## 7 Clark Fork 5.69 32 3.15 0.626
## 8 Tennessee 6.34 108 3.75 0.503
## 9 Pee Dee 7.78 124 4.00 0.464
## 10 St. Johns 7.71 585 4.71 0.617
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 181967 rows containing non-finite values (stat_density).
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.1670 1.5288 1.0672 0.67609 0.52972 0.24604
## Proportion of Variance 0.5218 0.2597 0.1265 0.05079 0.03118 0.00673
## Cumulative Proportion 0.5218 0.7815 0.9080 0.95880 0.98998 0.99671
## PC7 PC8 PC9
## Standard deviation 0.16113 0.05433 0.02650
## Proportion of Variance 0.00288 0.00033 0.00008
## Cumulative Proportion 0.99959 0.99992 1.00000
## Warning: Removed 351499 rows containing non-finite values (stat_boxplot).
## # A tibble: 4 x 5
## cluster chl_a doc tss secchi
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 4 8.37 3 3.66
## 2 2 8.01 4.23 10 1.52
## 3 3 10.7 4.3 21 0.8
## 4 4 7.64 5.5 7 2.44
## Warning: Removed 286 rows containing non-finite values (stat_boxplot).